Developer Release Note - Text Encoding Converter Manager 1.4
(August 2, 1998; updated September 22, 1998 - P. Edberg)
Version 1.4 of the Text Encoding Converter Manager (TEC) is included with Mac OS 8.5. This note describes changes from TEC 1.3, including the single bug fix in TEC 1.3.1.
1. Interface file changes
These are in Universal Interfaces 3.2, and will be in the interfaces included with Code Warrior Pro 4.
a) Added constant kTextEncodingUnicodeV2_1 (TextCommon.h) to indicate Unicode version 2.1. TEC 1.4 and later treat Unicode 2.0 as if it were Unicode 2.1, so the constant kTextEncodingUnicodeV2_1 has the same numeric value (0x0103) as the constant kTextEncodingUnicodeV2_0. TEC versions earlier than 1.4 do not support Unicode 2.1.
b) Added constant kTextEncodingMacUnicode (TextCommon.h), numeric value 0x007E. This is a meta-value, like kTextEncodingUnicodeDefault, and TEC handles it similary: It resolves kTextEncodingMacUnicode to an actual Unicode version, currently kTextEncodingUnicodeV2_1.
Beginning in Mac OS 8.5, the set of Mac OS script codes has been extended for some OS components to include Unicode. Some of these components have only 7 bits available for script code, so the constant kTextEncodingUnicodeDefault (0x0100) could not be used to indicate Unicode. Instead, kTextEncodingMacUnicode is used to indicate Unicode handled as a special Mac OS script code.
For example, kTextEncodingMacUnicode can be used to indicate Unicode in the 7-bit script code field of a Unicode input method's ComponentDescription.componentFlags field; it can also be used to indicate Unicode in the 16-bit script code field of an AppleEvent's typeIntlWritingCode text tag.
c) Added constants for TextEncodingVariant values that apply to kTextEncodingMacRoman (TextCommon.h). These are a consequence of the fact that the standard Mac OS Roman encoding has changed with Mac OS 8.5: The code point 0xDB, which was used for CURRENCY SIGN in earlier versions of Mac OS Roman, is now used for EURO SIGN. The relevant TEC changes are described in more detail in section 3c.
- kMacRomanStandardVariant: The standard variant of Mac OS Roman for Mac OS 8.5 and later; 0xDB is EURO SIGN.
- kMacRomanCurrencySignVariant: The variant of Mac OS Roman used before Mac OS 8.5, still used for some older fonts even in Mac OS 8.5; 0xDB is CURRENCY SIGN.
d) Added constants for more TextEncodingBase values (TextCommon.h). The corresponding encodings are not supported in TEC 1.4, but will be supported in a future TEC version:
kTextEncodingMacCeltic = 0x27 // Modified MacRoman (supports Welsh)
kTextEncodingMacGaelic = 0x28 // Modified MacRoman (Irish with dots)
kTextEncodingMacInuit = 0xEC // For Nunavut province of Canada
kTextEncodingISOLatin3 = 0x0203 // ISO 8859-3
kTextEncodingISOLatin4 = 0x0204 // ISO 8859-4
kTextEncodingWindowsVietnamese = 0x0508 // Windows code page 1258
e) Added constants for new conversion control flags for the iControlFlags parameter of ConvertFromTextToUnicode, ConvertFromUnicodeToText, etc. (UnicodeConverter.h):
- kUnicodeForceASCIIRangeBit: If an encoding normally treats one-byte code points 0x00-0x7F as an ISO 646 national variant that is different from ASCII, setting this bit will force 0x00-0x7F to be treated as ASCII. For example, Japanese encodings such as Shift-JIS generally treat 0x00-0x7F as JIS Roman, with 0x5C as YEN SIGN instead of REVERSE SOLIDUS, but when converting a DOS file path you may want to always map 0x5C as REVERSE SOLIDUS.
- kUnicodeNoHalfwidthCharsBit: Japanese encodings such as Shift-JIS and EUC-JP include a set of halfwidth katakana characters derived from JIS X0201 (0xA1-0xDF in Shift-JIS, 0x8EA1-0x8EDF in EUC-JP). Setting this bit will treat these encodings as if they did not include the halfwidth katakana; the corresponding code points will be unmappable.
kUnicodeForceASCIIRangeBit = 9
kUnicodeNoHalfwidthCharsBit = 10
kUnicodeForceASCIIRangeMask = 1L << kUnicodeForceASCIIRangeBit
kUnicodeNoHalfwidthCharsMask = 1L << kUnicodeNoHalfwidthCharsBit
Deprecated the constants for the never-supported TextEncodingVariant options that were previously intended to be used for implementing this capability (TextCommon.h): kJapaneseNoOneByteKanaOption, kJapaneseUseAsciiBackslashOption.
See section 3a below.
f) Defined new feature/fix bits (and corresponding masks) for the tecUnicodeConverterFeatures field of the TECInfo structure returned by TECGetInfo, to indicate new bug fixes/enhancements in TEC 1.4. These are (TextCommon.h):
- kTECAddForceASCIIChangesBit: Support new control flag bits kUnicodeForceASCIIRangeBit and kUnicodeNoHalfwidthCharsBit for use with ConvertFromTextToUnicode, ConvertFromUnicodeToText, etc. See sections 1e above and 3a below.
- kTECPreferredEncodingFixBit: CreateUnicodeToTextRunInfo and related functions fix a problem that occurred when a preferred encoding was specified that did not match the System script; the preferred script was not actually placed first in the ordered list of encodings to use. See section 2c below.
kTECAddForceASCIIChangesBit = 4
kTECPreferredEncodingFixBit = 5
kTECAddForceASCIIChangesMask = 1L << kTECAddForceASCIIChangesBit
kTECPreferredEncodingFixMask = 1L << kTECPreferredEncodingFixBit
g) Added TextCommon.r (containing the TextEncoding constants as #defines) and UnicodeConverter.r (containing the conversion control flag constants as #defines).
h) Made the stub libraries FAT. Before TEC 1.4 they were PPC only, even though the TEC implementation shared libraries are FAT (supporting PPC and CFM 68K).
2. Implementation bug fixes
a) When HFS Extended, UDF/DVD, or PC Exchange were used, the TEC functions ConvertFromUnicodeToTextRun and ConvertFromUnicodeToScriptCodeRun would dereference a null handle, eventually causing a crash. In addition, in rare circumstances it was possible for TEC to make Memory Manager calls at interrupt time, resulting in HFS Extended catalog corruption and/or incorrect error values returned from application Memory Manager calls. This has been fixed (#2251442).
b) If the boot volume containing the Text Encoding Converter was in HFS Extended format and the name of the TEC extension or the names of any Text Encodings files were localized into non-ASCII characters, the TEC tables in those files were inaccessible. During booting, HFS Extended mangles non-ASCII names; these must be unmangled to access files after TEC has been loaded. (#2216745)
Note: There is a related fix that was made in the File Manager after Mac OS 8.1 was released (this fix is in Mac OS 8.5, and in some localized versions of Mac OS 8.1). Both changes are necessary to fix this problem. In particular, the problem is still present with TEC 1.4 on U.S. versions of Mac OS 8.1.
c) If CreateUnicodeToTextRunInfo was called with iNumberOfMappings=-1 and a preferred mapping that was different than the system script, the preferred mapping was often ignored and the system script used instead. (#2215984)
d) For ConvertFromUnicodeToText[Run]: If an undefined Unicode is not the first character in a text element, then it should just terminate the text element; the functions should not return kTextUndefinedElementErr unless the undefined Unicode begins a text element. (#2212628)
e) If an odd value was passed as the iUnicodeLen parameter to ConvertFromUnicodeToText for UCS-2/UTF-16 text, it attempted to convert the partial Unicode character (and claimed that it read one byte beyond the length passed in iUnicodeLen). It now returns kTECPartialCharErr and only converts through the last complete character. Note that iUnicodeLen can legitimately be odd for UTF-8 text. (#2212626)
f) Preserve current resource file across TextCommon cfrg initialization (and across InitializeUnicodeConverter). (part of #2203534)
g) TECConvertText could hang when processing EUC-JP if it encountered a partial character (e.g. just 0x8F) at the end of a buffer. (#2219197)
h) TECSniffTextEncoding did not record errors for invalid single-byte characters in ISO-2022-JP. (#2218976)
i) This was the single bug fix in TEC 1.3.1: TECConvertText was writing over low memory (e.g. address 8). A structure was being initialized before the pointer used to access it was initialized (the pointer was NULL). (#2203186)
3. Implementation enhancements and changes
a) Implement new UnicodeConverter options kUnicodeForceASCIIRangeBit and kUnicodeNoHalfwidthCharsBit. These are used for the iControlFlags parameter of ConvertFromTextToUnicode, ConvertFromUnicodeToText, etc. See item 1e above. (#1644700)
Implementing these options entailed rearranging some mapping tables and changing some mappings; these mapping changes (and others) are described in section 4.
b) Upgrade to support Unicode 2.1 (which becomes the default Unicode version). Unicode 2.1 adds the EURO SIGN and OBJECT REPLACEMENT CHARACTER, makes some changes to direction classes, and has a few other changes that do not affect TEC operation. (part of #2203409, #2237762)
c) Support the addition of EURO SIGN to Mac OS Roman and Mac OS Symbol. These encodings --and the fonts associated with them that Apple ships--are changing for Mac OS 8.5 to support the new EURO SIGN character.
Mac OS Roman had no unassigned code points. So code point 0xDB, which was formerly CURRENCY SIGN, has been reassigned as EURO SIGN. TEC handles this as follows (#2203409, #2257036):
- For kTextEncodingMacRoman, there are now two variants: kMacRomanStandardVariant (=kTextEncodingDefaultVariant) and kMacRomanCurrencySignVariant. Previously there was only kTextEncodingDefaultVariant for kTextEncodingMacRoman.
- For kMacRomanStandardVariant, 0xDB maps to Unicode 0x20AC (EURO SIGN) when mapping to Unicode 2.0 or 2.1, and to private-use Unicode 0xF8A0 when mapping to Unicode 1.1 (since 0x20AC is not defined for Unicode 1.1).
- For kMacRomanCurrencySignVariant, 0xDB maps to Unicode 0x00A4 (CURRENCY SIGN). This is the mapping that was used for MacRoman DefaultVariant in TEC versions earlier than 1.4.
- For kTextEncodingMacRoman, UpgradeScriptInfoToTextEncoding will choose kMacRomanCurrencySignVariant when running on systems earlier than Mac OS 8.5, and will choose kMacRomanStandardVariant when running on Mac OS 8.5 and later. If your conversion requirements need more finesse and need to depend on whether a particular MacRoman font supports EURO SIGN or CURRENCY SIGN, check with Apple's font group for recommendations on how to determine this.
Mac OS Symbol had several unassigned code points. One of these, 0xA0, is now assigned EURO SIGN. So the mapping for kTextEncodingMacSymbol is changed as follows: 0xA0 now maps to Unicode 0x20AC (EURO SIGN) when mapping to Unicode 2.0 or 2.1, and to private-use Unicode 0xF8A0 when mapping to Unicode 1.1 (as above). (#2256792)
d) Add support for the following encodings (#2227042):
- kTextEncodingDOSLatin1 (DOS code page 850, "multilingual")
- kTextEncodingDOSThai (Windows/DOS code page 874; based on TIS 620-2533)
- kTextEncodingDOSJapanese (Windows/DOS code page 932; based on Shift-JIS)
- kTextEncodingDOSChineseSimplif (Windows/DOS code page 936; equivalent to GBK)
- kTextEncodingDOSChineseTrad (Windows/DOS code page 950; based on Big-5)
e) Add Unicode Converter support for EUC-JP. In previous TEC versions, the high-level Text Encoding Converter provided algorithmic conversion between EUC-JP and ISO 2022-JP or Shift-JIS/MacJapanese. However, this resulted in loss of the JIS X0212 characters. Now all of the EUC-JP characters, including the JIS X0212 characters, can be converted to and from Unicode. The new options kUnicodeForceASCIIRangeBit and kUnicodeNoHalfwidthCharsBit (see 3a above) can be used with EUC-JP. (#1643497)
f) Support 4-byte codeset 2 characters in EUC-TW. In previous TEC versions, ConvertFromTextToUnicode supported only the 1-byte codeset 0 and 2-byte codeset 1 characters in EUC-TW. ConvertFromUnicodeToText[Run] could map from Unicode to 4-byte characters in EUC-TW for codeset 2, planes 2 and 3, but only as loose mappings (since the 4-byte EUC-TW characters could not be mapped back to Unicode. (#1634875)
In TEC 1.4, the 4-byte characters for EUC-TW codeset 2, planes 2 and 3, can be mapped to or from Unicode as strict mappings. The 4-byte characters for EUC-TW codeset 2, plane 1, can only be mapped to Unicode as loose mappings, since the resulting Unicode characters are mapped back to EUC-TW as 2-byte codeset 1 characters (CNS plane 1 is redundantly encoded in EUC-TW).
g) GetTextEncodingName can now return many different localized names for any supported encoding (#2234345). The following localized versions are available for all encoding names (Additional localized versions, such as verChina/kTextEncodingMacChineseSimp, are available for certain names):
- verUS kTextEncodingUS_ASCII (subset of kTextEncodingMacRoman)
- verBritain kTextEncodingUS_ASCII (subset of kTextEncodingMacRoman)
- verFrance kTextEncodingMacRoman
- verFrCanada kTextEncodingMacRoman
- verFrSwiss kTextEncodingMacRoman
- verFrenchUniversal kTextEncodingMacRoman
- verGermany kTextEncodingMacRoman
- verItaly kTextEncodingMacRoman
- verNetherlands kTextEncodingMacRoman
- verSweden kTextEncodingMacRoman
- verSpain kTextEncodingMacRoman
- verPortugal kTextEncodingMacRoman
- verNorway kTextEncodingMacRoman
- verFinland kTextEncodingMacRoman
- verTurkey kTextEncodingMacTurkish
- verCzech kTextEncodingMacCentralEurRoman
- verJapan kTextEncodingMacJapanese
- verKorea kTextEncodingMacKorean
As part of this, the names that previously included "Unicode 2.0" now have "Unicode 2.1".
Note that the names returned by GetTextEncodingName may contain parentheses, and so cannot be used with AppendMenu or InsertMenuItem. They can, however, be used with SetMenuItemText. This is not a change with TEC 1.4, but it was not previously mentioned in the TEC documentation (it will be for TEC 1.4).
h) ResolveDefaultTextEncoding maps kTextEncodingMacUnicode (see 1b above) to kTextEncodingUnicodeV2_1. CreateTextToUnicodeInfo, CreateUnicodeToTextInfo, and other functions call ResolveDefaultTextEncoding and thus handle kTextEncodingMacUnicode. (#2254112)
i) Script.h (and Script.r, etc.) was brought up to date with many new language and region codes for Universal Interfaces 3.1, with additional updates for Universal Interfaces 3.2. UpgradeScriptInfoToTextEncoding and RevertTextEncodingToScriptInfo have been updated for TEC 1.4 to handle the new language and region codes. (#2203406)
j) Updated TECGetInfo to set the new feature bits described in section 1f above. (#2252655)
k) There were several changes to the Internet name mappings for TECGetTextEncodingFromInternetName and TECGetTextEncodingInternetName:
- Made "KSC5601" map to kTextEncodingEUC_KR. (#2215900)
- Make "GB_2312-80" map to kTextEncodingEUC_CN (de facto usage) instead of kTextEncodingGB_2312_80 (IANA definition); now kTextEncodingGB_2312_80 maps to "csISO58GB231280" (IANA alias) and back. (#2215899)
- Make "KS_C_5601-1987" map to kTextEncodingEUC_KR (de facto usage) instead of kTextEncodingKSC_5601_87 (IANA definition); now kTextEncodingKSC_5601_87 maps to "csKSC56011987" (IANA alias) and back. (#2215899)
l) Moved as many resources as possible from the Text Encoding Converter extension to the files in the Text Encodings folder, leaving only the resources needed to support mapping between Mac OS encodings and the kUnicodeCanonicalDecompVariant variant of kTextEncodingUnicodeV2_1. Resources that were moved include:
- The resources needed by GetTextEncodingName for all non-Unicode encodings
- The resources needed to support mapping between kTextEncodingMacRoman and the standard variant of kTextEncodingUnicodeV2_1.
Functionality that depends on those resources will not work if the TEC 1.4 extension is used with the Text Encodings files from TEC 1.3 (because the necessary resources will be unavailable).
m) The Mac OS 8.5 System file includes PPC versions of the TextCommon and UnicodeConverter shared libraries, as well as the resources necessary to convert between Mac OS encodings and the kUnicodeCanonicalDecompVariant variant of kTextEncodingUnicodeV2_1. These libraries and resources are duplicates of those in the Text Encoding Converter extension file from TEC 1.4.
n) Implementation shared libraries have current version of 2 for TEC 1.4. (#2252349)
4. Other mapping changes
a) kTextEncodingMacJapanese, kMacJapaneseStandardVariant:
- Change strict mapping of 0x8650 from Unicodes 0xFF4C+0xF87F to 0x2113. The old mapping from Unicodes 0xFF4C+0xF87F to Mac OS Japanese 0x8650 is preserved as a loose mapping. (#2247681)
- Change strict mapping of 0x8855 from Unicode 0x301E to 0x301F; this results from a clarification in the Unicode Standard about Unicodes 0x301E and 0x301F. The old mapping from Unicode 0x301E to Mac OS Japanese 0x8855 is preserved as a loose mapping. (#2234348)
b) kTextEncodingMacJapanese, kMacJapanesePostScriptScrnVariant (used for fonts SaiMincho and ChuGothic): Add missing mappings for characters in the range 0x86A2-0x879C. Before TEC 1.4, these mappings were only for kMacJapanesePostScriptPrintVariant. In TEC 1.4 they are added for both kMacJapanesePostScriptScrnVariant and kTextEncodingDOSJapanese. Some of the characters in this range are duplicates of standard Shift-JIS characters so the mappings are changed in TEC 1.4 to ensure roundtrip fidelity (which is not required for kMacJapanesePostScriptPrintVariant), as follows (#2245742):
- Use transcoding hints for all the mappings in 0x86A2-0x879C that would otherwise duplicate mappings of standard Shift-JIS characters. Mappings affected are for characters in the ranges 0x8790-0x8792, 0x8794-0x8797, 0x879A-0x879C.
- Change mapping of 0x8794 from Unicode 0x03A3 to 0x2211.
- Change mapping of 0x8642 for PostScript variants from 0x301E to 0x301F (as in 4a).
c) kTextEncodingMacChineseSimp
- Fix incorrect loose mapping from Unicode 0x2022; was mapping to 0xA145, now maps to 0xA1A4. (#2216804)
d) kTextEncodingShiftJIS
- Change mapping of 0x815F from Unicode 0x005C to 0xFF3C. (part of #1644700)
e) kTextEncodingBig5
- Make 0x83-0x9F unmappable instead of mapping them to Unicode C1 controls 0x0083-0x009F. (part of #2227042)
f) kTextEncodingGBK_95
- Make 0x80 unmappable instead of mapping it to Unicode C1 control 0x0080. (part of #2227042)
g) Updated the mapping tables for Windows 8-bit codepages to reflect the addition of EURO SIGN, and other recent additions per Windows mapping tables posted at <ftp.unicode.org> (#2237762). The new code points are as follows:
- kTextEncodingWindowsLatin1: 0x80 (EURO), 0x8E, 0x9E.
- kTextEncodingWindowsLatin2: 0x80 (EURO), 0xAC.
- kTextEncodingWindowsCyrillic: 0x88 (EURO).
- kTextEncodingWindowsGreek: 0x80 (EURO).
- kTextEncodingWindowsLatin5: 0x80 (EURO).
- kTextEncodingWindowsHebrew: 0x80 (EURO), 0xA1, 0xAA, 0xB8, 0xBA, 0xBF, 0xD7, 0xD8.
- kTextEncodingWindowsArabic: 0x80 (EURO), 0x81, 0x8D, 0x8E, 0x90. Also make the following undefined instead of mapping to Unicode C1 controls or private-use characters: 0x8A, 0x8F, 0x98, 0x9A, 0x9F, 0xAA, 0xC0, 0xFF.
- # -